This tutorial will follow most steps from the Moving Pictures Tutorial which might be found on the qiime2 website (https://docs.qiime2.org/2018.6/tutorials/moving-pictures/) and integrate it with separate instructions for the ITS sequences. These scripts are based on the current release of qiime2 (2018.06), but keep in mind that the authors tend to change the tutorial page and update their commands at every new release of qiime2.
You can ask the Hipergator staff to update qiime2 on the support page, but they seem to keep it pretty current; be sure to modify the “module load” command accordingly in all scripts if you want to use different version. Most of the scripts are based on the previous 16Sr-RNA Preprocessing Pipeline.
To login on Hipergator follow the procedures reported at https://help.rc.ufl.edu/doc/Getting_Started#Connecting_to_HiPerGator. In the same page, there are also instructions for how to upload data to Hipergator. However, BEFORE uploading the data, create a folder structure that will be useful for your purposes. In this case, I will be showing both 16S and ITS pipelines, and then how to merge the tables.
After logging in, navigate to your working directory and create the necessary subfolders:
cd /ufrc/strauss/your_account/your_working_directory
mkdir 16S
mkdir ITS
mkdir merged
If you want to know who’s currently working on hipergator, or if you want to check how your job is performing, you can enter this command:
watch squeue -A strauss -O jobid:10,username,name,partition,state:10,reason:10,cores,gres:10,tres:30,timeused:10
Try keep the sums of the allocated resources lower than the account limits: some batches may just wait for the excess resources until free, others will just break the currently running batches. To exit the activity monitor press ctrl + c.
First things first: once you have your designed working folder, move your raw sequences into a raw_seqs directory created therein. Secondly, it is useful to have a single folder named scripts from where to edit and launch your bash commands. All scripts will also need the logs directory which you should create manually, in order for your commands will produce their logfile for backtrace.
NOTE: contrarily to qiime1, qiime2 doesn’t need (well, it won’t actually accept) uncompressed fastq files. Don’t `gunzip` them.
NOTE2: these scripts work only for already demultiplexed paired sequences. If you have multiplexed sequences or single reads you should head to the qiime2 website and adjust the parameters of all scripts accordingly.
*IMPORTANT*: before going on with the scripts make sure there are no duplicates in the sample file names (i.e. the part of the file name that is before `_R1_` or `_R2_`)!! Change file names accurately and then proceed with the next script.
cd 16S
mkdir raw_seqs
mkdir logs
mkdir scripts
Now move your compressed sequence files in the raw_seqs directory following the instructions frtom the Hipergator wiki page. Then start writing the first bash script, using
nano scripts/0_setup.sh
You will be prompted with a text editor where you should copy-paste the following:
#!/bin/bash
# ----------------SLURM Parameters----------------
#SBATCH -A strauss
#SBATCH -J setup
#SBATCH -N 1
#SBATCH -D /ufrc/strauss/your_account/your_working_directory/16S #PATH OF YOUR WORKING FOLDER
#SBATCH -o logs/0_setup_%j.out #PATH OF LOG FILE
date;hostname;pwd
################################################################################
#
# Setup data directory and copy raw_seqs contents
#
################################################################################
# ----------------Housekeeping--------------------
rm -rf data
mkdir data
mkdir data/raw_data
# ------------------Commands----------------------
# If sequences are individual directories (if not, comment this out):
for dir in raw_seqs/*; do
cp $dir/* data/raw_data/.
done
# Fix names: this depends of course on how your files are named. CHECK THEM!
for file in data/raw_data/*.gz; do
mv $file ${file/Strauss?-/};
mv $file ${file/Strauss??-/};
mv $file ${file/Strauss???-/};
done
date
Exit using ctrl + X and save with Y. Then do
nano scripts/0_setup.sh
Estimated time: few minutes
Now it is time to import our sequences. This script will create a Qiime2 artifact which contains the sequences data (they call it FeatureData[Sequence]). This script differs from the Moving Pictures Tutorial, because it is tailored for paired end demultiplexed sequences.
Qiime2 .qza artifacts are actually compressed files which contain a typical qiime1 output file (which might be a biom table, a taxonomy table or a fasta file) hidden in some subfolders and accompanied by the files necessary for generating the visualization. Visualization are the “big thing” of qiime2, which allows the implementation of a graphic interface as well, and allow to view the data graphically with interactive plots and tables.
The script produces a visualization which summarizes the quality data and the lengths of your reads (i.e. similar to the stats.txt produced by the USEARCH pipeline commands). This can be viewed by loading it on the https://view.qiime2.org website.
NOTE: HiperGator does not have a visual interface, at least in strauss partition. Therefore, it is likely that when the scripts will look for matplotlib to build the visualization file, they would return an error.
To avoid that, simply write an empty file in your config with
nano ~/.config/matplotlib/matplotlibrc
You will then be prompted an empty page in which you will write this line only
backend : agg
Exit using ctrl + X and save with Y. Then do
nano scripts/1_import.sh
And copypaste the following
#!/bin/bash
# ----------------SLURM Parameters----------------
#SBATCH -A strauss
#SBATCH -J q2_import
#SBATCH --time=2:00:00 #Maximum number of hours
#SBATCH -N 1 #Number of nodes
#SBATCH --mem=10g #Allocated RAM
#SBATCH -D /ufrc/strauss/your_account/your_working_directory/16S #PATH OF YOUR WORKING FOLDER
#SBATCH -o logs/1_q2_import_%j.out #PATH OF LOG FILE
date;hostname;pwd
################################################################################
#
# Import fastq.gz files into a .qza file
#
################################################################################
# ----------------Load Modules--------------------
module load qiime2
# ----------------Housekeeping---------------------
rm -r demux*.q*
cd data
# ----------------Commands------------------------
#Import Data in qiime2 artifact
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path raw_data \
--input-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path demux-paired-end.qza
# Create Data Visuailizations
# ATTENTION: if your demux give you "invalid DISPLAY variable error" it's because you don't have graphic interface.
# To solve it create a file using nano ~/.config/matplotlib/matplotlibrc
# And write the following
# backend : agg
qiime demux summarize \
--i-data demux-paired-end.qza \
--o-visualization demux-paired-end.qzv
# Unload modules:
module unload qiime2
date
Exit using ctrl + X and save with Y. Then do
sbatch scripts/1_import.sh
After the script finishes, load the demux-paired-end.qzv file in the https://view.qiime2.org website. You will see when the quality of your reads degenerates through an interactive box-plot. Through this box-plot, you can determine at which length you would truncate the forward and the reverse reads respectively. You will need these values for the next script. The idea is, of course, to find a compromise between retaining the higher number of reads and retaining the best quality of them.
Estimated time: many hours The next step is dereplication. This script uses the DADA2 R package (https://www.nature.com/articles/nmeth.3869) which is implemented in qiime2 pipeline.
Note: For a comparison of DADA2 amplicon sequence variants calling against the current OTU picking algorithm (2016) check https://benjjneb.github.io/dada2/SotA.html.
Accuracy: DADA2’s crucial advantage is that it uses more of the data. The DADA2 error model incorporates quality information, which is ignored by all other methods after filtering. The DADA2 error model incorporates quantitative abundances, whereas most other methods use abundance ranks if they use abundance at all. The DADA2 error model identifies the differences between sequences, eg. A->C, whereas other methods merely count the mismatches. DADA2 can parameterize its error model from the data itself, rather than relying on previous datasets that may or may not reflect the PCR and sequencing protocols used in your study.
Performance: DADA2’s computational scaling gains come from the fact that it infers sequences exactly rather than constructing OTUs. De novo OTUs cannot be compared across samples unless all samples were pooled during OTU construction. However, exact sequences are comparable across samples, as exact sequences are consistent labels. Thus DADA2 can analyze each sample independently, resulting in linear scaling with sample number and trivial parallelization.
It is extremely accurate and lowers the fraction of false positives. In our previous comparisons against the USEARCH clustering method, DADA2 was shown to better discern the samples and result in a much higher diversity. Technically, it relies on inference over sequences rather than OTU clustering, and this is why it is able to better discern a genetic variation from a different species (see website for more info).
The script will create a new folder called features and perform its analysis there which include: denoising of the sequences, dereplication, chimera removal all in one step. The script uses a parallelized version of the DADA2 algorithm, splitting the works on 12 cores. The more the cores, the shorter the time it will take. It will result in a table.qza file, which contains an atypical biom table, quasi-HDF5, and a rep-seqs.qza file, from which you can extract a the fasta file containing representative sequences. Both of them will be used in the following scripts.
NOTE: in order to truncate low quality regions at the end of the sequences, the script needs you to specifiy the length of each sequence (fw and rev) in number of bases. Use the quality plots from the .qzv file produced in the previous step to determine such number and insert it in place of the two placeholders specified.
nano scripts/2_dada2.sh
#!/bin/bash
# ----------------SLURM Parameters----------------
#SBATCH -A strauss
#SBATCH -J par_dada2
#SBATCH -N 1
#SBATCH --time=24:00:00
#SBATCH --mem=40g #TOTAL RAM PER TASK
#SBATCH -n 12 #NUMBER OF CPUS PER TASK
#SBATCH -D /ufrc/strauss/your_account/your_working_directory/16S #PATH OF YOUR WORKING FOLDER
#SBATCH -o logs/2_q2_dada2_%j.out #PATH OF LOG FILE
date;hostname;pwd
################################################################################
#
# Filters paired-end sequences based on quality, merges dereplicates them
# using DADA2 algorithm. Includes chimera removal.
#
################################################################################
# ----------------Load Modules--------------------
module load qiime2
# ----------------Housekeeping--------------------
rm -r features
mkdir features
cp data/demux-paired-end.qza features/dada2input.qza
cd features
# ----------------Commands------------------------
qiime dada2 denoise-paired \
--i-demultiplexed-seqs dada2input.qza \
--p-n-threads 0 \
--p-trunc-len-f \ #INSERT NUMBER OF BASES BEFORE THE FORWARD SLASH AND DELETE THIS INSTRUCTION
--p-trunc-len-r \ #INSERT NUMBER OF BASES BEFORE THE FORWARD SLASH AND DELETE THIS INSTRUCTION
--o-table table.qza \
--o-denoising-stats stats.qza \
--o-representative-sequences rep-seqs.qza
module unload qiime2
date
Exit using ctrl + X and save with Y. Then do
sbatch scripts/2_dada2.sh
Estimated time: minutes
This script will build a feature table, which is a similar concept of the OTU table from qiime1. Before applying this script it is highly recommended to check the mapping file, since errors in file names will be carried on until the end of the analysis. A good tool to validate qiime1 and qiime2 mapping files is the Keemei add-on for google spreadsheet, which runs in chrome (can be added from the chrome store http://keemei.qiime.org). Once installed, you need to open an empty google spreadsheet, then go to add-on, Keemei, and validate the mapping file for the version of qiime required. The process is interactive and will guide you through duplicate values, formatting issues, etc.
Finally, the script will also produce visualization of both the sequence file (rep-seqs.qza) and of the feature table (table.qza), which will be useful for the next steps.
nano scripts/3_feature_table.sh
#!/bin/bash
# ----------------SLURM Parameters----------------
#SBATCH -A strauss
#SBATCH -J q2_feattab
#SBATCH -N 1
#SBATCH --time=2:00:00
#SBATCH --mem=10g #TOTAL RAM PER TASK
#SBATCH -D /ufrc/strauss/your_account/your_working_directory/16S #PATH OF YOUR WORKING FOLDER
#SBATCH -o logs/3_q2_feature_table_%j.out #PATH OF LOG FILE
date;hostname;pwd
################################################################################
#
# Build feature table required for further analysis
#
################################################################################
# ----------------Load Modules--------------------
module load qiime2
# Housekeeping:
cd features
rm table.qzv
rm rep-seqs.qzv
rm reindexed-table.qza
# ----------------Commands------------------------
qiime feature-table summarize \
--i-table table.qza \
--o-visualization table.qzv \
--m-sample-metadata-file ../blabla #INSERT PATH TO VALIDATED MAPPING FILE
qiime feature-table tabulate-seqs \
--i-data rep-seqs.qza \
--o-visualization rep-seqs.qzv
# Unload modules:
module unload qiime2
date
Exit using ctrl + X and save with Y. Then do
sbatch scripts/3_feature_table.sh
Estimated time: hours
This script provides the taxonomic assignments to your sequences. To do so, it will use a pre-trained dataset, which is a sequence database inferred through the application of the naïve Bayes classifier, and use a machine learning algorithm to assign the taxonomy of your sequence to that classifier. Therefore, it requires high computational power, and for this reason the script allocates many processors into which the script is parallelized, and a high amount of memory as well. The qiime2 website provides a classifier for Greengenes 13_8 and a classifier for SILVA 119; be careful to download the full sequences and not the truncated sequences for the V3 region only.
nano scripts/4_taxonomy_SILVA.sh
#!/bin/bash
# ----------------SLURM Parameters----------------
#SBATCH -A strauss
#SBATCH -J q2_scikit_tax
#SBATCH -N 1
#SBATCH --time=72:00:00
#SBATCH --mem=30g #TOTAL RAM PER TASK
#SBATCH -n 8 #NUMBER OF CPUS PER TASK
#SBATCH -D /ufrc/strauss/your_account/your_working_directory/16S #PATH OF YOUR WORKING FOLDER
#SBATCH -o logs/4_q2_tax_SILVA128_%j.out #PATH OF LOG FILE
date;hostname;pwd
################################################################################
#
# Assign taxonomy using a pre-trained scikit classifier version 0.19.1
#
################################################################################
# ----------------Load Modules--------------------
module load qiime2
# ----------------Housekeeping--------------------
cd features
rm taxonomy.qza
rm taxa-bar-plots.tsv
rm taxonomy.qzv
# ----------------Commands------------------------
qiime feature-classifier classify-sklearn \
--i-reads rep-seqs.qza \
--o-classification taxonomy.qza \
--i-classifier /ufrc/strauss/blabla #INSERT PATH TO YOUR PRETRAINED DATABASE
qiime metadata tabulate \
--m-input-file taxonomy.qza \
--o-visualization taxonomy.qzv
qiime taxa barplot \
--i-table table.qza \
--i-taxonomy taxonomy.qza \
--o-visualization taxa-bar-plots.qzv \
--m-metadata-file ../blabla #INSERT PATH TO VALIDATED MAPPING FILE
# Unload modules:
module unload qiime2
date
Exit using ctrl + X and save with Y. Then do
sbatch scripts/4_taxonomy_SILVA.sh